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The interaction of human social behaviour and transmission is an intriguing 
aspect of the life cycle of respiratory viral infections. Although age-specific 
mixing patterns are often assumed to be the key drivers of the age-specific het- 
erogeneity in transmission, the association between social contacts and 
biologically confirmed infection has not previously been tested at the individual 
level. We administered a questionnaire to participants in a longitudinal cohort 
survey of influenza in which infection was defined by longitudinal paired ser- 
ology. Using a variety of statistical approaches, we found overwhelming 
support for the inclusion of individual age in addition to contact variables 
when explaining odds of infection: the best model not including age explained 
only 15.7% of the deviance, whereas the best model with age explained 23.6%. 
However, within age groups, we did observe an association between contacts, 
locations and infection: median numbers of contacts (or locations) reported by 
those infected were higher than those from the uninfected group in every age 
group other than the youngest. Further, we found some support for the retention 
of location and contact variables in addition to age in our regression models, with 
excess odds of infection of approximately 10% per additional 10 contacts or one 
location. These results suggest that, although the relationship between age and 
incidence of respiratory infection at the level of the individual is not driven by 
self-reported social contacts, risk within an age group may be. 



1. Introduction 

The interaction of social behaviour and infectious disease transmission is a com- 
plex and fascinating process across many host -pathogen systems. Behaviours 
such as, social avoidance, mate choice, monogamy and group size, all influence 
the likelihood of transmission [1]. Even though snapshot data on any one aspect 
of behaviour does not capture the complex feedback within an entire popu- 
lation, good evidence of a causal link between behaviour type and infection 
has been reported many times, based on cross-sectional fieldwork. For example, 
badger Meles meles that move further and more frequently are more likely to be 
infected with bovine tuberculosis [2]. Among human pathogens, high rates of 
biting are associated with increased incidence of vector-borne infection [3] 
and numbers of sexual partners are predictive of an increased risk of HIV infec- 
tion [4]. However, despite some suggestive evidence [5], the link between 
human behaviour and the transmission of respiratory viruses is less clear. 
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Social activities such as face-to-face conversations, skin- 
on-skin contact and the sharing of locations must, to some 
degree, influence an individual's risk of infection with a directly 
transmitted pathogen such as influenza [6-8]. However, the 
potential association between self-reported social contacts and 
the transmission of respiratory pathogens has not previously 
been tested empirically at the level of the individual. Those 
who report more frequent social contact should be at a higher 
risk of infection during an epidemic, baring other considerations 
such as immunity and differences in susceptibility. 

The 2009 influenza pandemic, with substantial levels of 
infection in all but the eldest age groups, provided an ideal 
opportunity to investigate directly the correlation between 
self-reported social contacts, the locations at which those con- 
tacts occurred and infection with a common respiratory 
pathogen. Specifically, we wanted to test the hypothesis 
that the higher rates of infection observed in children rela- 
tive to adults may be partly explained by differences in 
self-reported social contacts. 

2. Material and methods 

(a) Fieldwork 

We embedded an interviewer-led social-contact questionnaire 
(electronic supplementary material, text SI) within the 2009/2010 
Hong Kong serological survey [9]. During telephone calls to sche- 
dule follow-up visits, participants were assigned a random day 
from days between the call and the visit. The nature of the question- 
naire was described, and participants were asked to remember 
where they went and whom they met on their assigned day. Con- 
tact persons with whom subjects had face-to-face conversation or 
skin-on-skin contact were recorded. Details of contacts were also 
recorded, such as the location at which the contact occurred, the 
duration of the meeting and the approximate age of the contact. 
The questionnaire was administered face to face by the research 
team during the follow-up interview, and the forms were designed, 
so that supplemental sheets could be added easily. Hence, parti- 
cipants did not see a finite space that may have indicated 
subconsciously the desired number of contacts, nor the maximum 
number of contacts [10]. 

(b) Infection outcome 

Blood samples were taken at baseline and follow-up interviews. 
Infection was defined by a fourfold or greater rise in microneutrali- 
zation (MN) titres against the 2009 pandemic influenza strain 
A(HINI) A/Califomia/4/2009 (with a final titre of 1 : 40 or greater) 
[9]. The endpoint of our MN assay was the highest serum dilution 
that suppressed cytopathogenic effect after 3 days incubation. 

(c) Base statistical model 

The analyses we present here build on previously published stat- 
istical models of these infection outcomes for which the contact 
data used here were not available [9]. The best model in these 
previous analyses included the age of the participant as a 
linear term, the presence or absence of a child in the household 
and the district of residence. Although we did consider a 
smooth term for age [11], it was not supported once the presence 
or absence of child was also included. 

(d) Classifying exposure variables 

Three groups of additional potentially explanatory variables 
were derived from the social-contact questionnaire and the stan- 
dard questionnaire [9]: three key variables not related to the 



social-contact questionnaire (age of participant, presence or 
absence of a child in the household and district of residence; elec- 
tronic supplementary material, table SI; as described above); 
10 variables derived directly from specific questions about social 
contacts on the day of interest (e.g. number of contacts with dur- 
ation greater than or equal to 10 min; electronic supplementary 
material, table S2) and three summary variables (average age of 
contacts, total minimum contact time and number of locations; 
electronic supplementary material, table S3). 

(e) Capturing group mixing 

Individuals with large numbers of contacts or large numbers of 
contact locations often had a number of group meetings. Therefore, 
the questionnaire was administered so as to capture inherent 
uncertainty in the recall of groups. For example, when participants 
were not certain of the size of a group, they were encouraged to 
respond with a range for the number of people in a group. In 
addition, meetings of known size may have had different duration 
of contacts for subgroups or individuals. Because it was not feas- 
ible to break down fuUy every group contact (while maintaining 
a reasonable interview duration), interviewers recorded a single 
event with multiple durations. Therefore, for each contact variable 
(electronic supplementary material, tables S2 and S3), we inferred 
maximum and minimum values for each study participant from 
their responses. We used maximum, minimum and midpoint 
values (median of maximum and minimum) as potential explana- 
tory variables in our regression analyses. We provide all 
individual-level data used for these analyses in the electronic 
supplementary material, dataset SI. 

(f) Hypothesis-driven best subset search of 
model space 

We designed our statistical analyses to build on previous work in 
which age, district of residence and the presence or absence of 
a child in the household had all been found to be important 
determinants of the odds of infection. In moving on to examine 
self-reported social contacts, our primary hypothesis was that 
age effects in our previous work could be explained by some 
reasonable combination of self-reported social behaviour. 

Every question on the contact questionnaire was asked because 
it may have captured some aspect of that behaviour. However, 
not all the questions were independent of each other, either 
numerically or conceptually. In addition, we were limited to 
approximately 80 positive outcomes and wanted to stay within 
the safe ratio of one parameter for each 10 positives for logistic 
regression. Therefore, we defined a set of 1408 models based 
on our own perception of the relationship between alternate vari- 
ables (electronic supplementary material, section Material and 
methods). We fitted those models to the data as they were and 
then again with the addition of a linear age term and compared 
the results using standard information criteria. 

(g) Hypothesis-agnostic group lasso search of 
model space 

Although our preference was to use a hypothesis-driven 
approach, best subset regression has been criticized because it 
can be biased towards lower p-values and higher effect sizes 
[12]. Therefore, we also analysed our data without pre-judging 
the likely degree of coUinearity in our exposure variables but 
within a framework with an explicit penalty for additional par- 
ameters. We used group lasso logistic regression [13] with A set 
to 4 (approximately natural log of number of groups). In order 
to explore the sensitivity of our model results to outliers in our 
data and to obtain confidence intervals for our non-likelihood 
method, we conducted a number of analyses on bootstrapped 



variants of our data. For each bootstrap dataset, we drew the 
same number of individual records (participants) from our 
data with replacement. Confidence intervals for bootstrap runs 
were the middle 95% point estimates. 

3. Results 

(a) Study population 

The parent study was a community-based serological survey 
of households in Hong Kong, recruited using random digit 
dialling [9]. The first baseline blood sample was taken on 
4 July 2009 and the last on 19 September 2009, whereas the 
first follow-up sample was taken on 11 November 2009 and 
the last taken on 6 February 2010. The 770 participants who 
provided paired sera in the parent study also provided 
their contact diaries in their first follow-up visit (of which 
eight were excluded because their contact diaries were not 
sufficiently complete). Of these eight, only a single 9 year 
old was infected. The ages of the remaining seven uninfected 
excluded individuals ranged from 8 to 73 years. The cohort 
for the current study was made up of 762 individuals of 
whom 77 were defined to have been infected (electronic 
supplementary material, dataset SI). 

(b) Contacts and locations 

The total of the number of contacts reported by all 762 
participants was between 13 730 (sum of minimum values) 
and 14 163 (sum of maximum values), giving a range of 18.0- 
18.6 contacts per person. Adults between 30 and 39 years 
made more contacts than did other age groups in our study 
(figure la), even if subjects with a number of contacts more 
than 100 were excluded. Young children (2-9 years) and 
adults between 50 and 59 years made slightly fewer contacts, 
whereas adults over 70 years of age made considerably fewer 
contacts. Contacts were associative by age: the greatest 
number of contacts in the 2-5 year range were made by partici- 
pants in the first age group; the greatest number of contacts 
in the 6-19 year range were made by participants in the 
10-19 year group and the greatest number of contacts over 
age 65 were made by participants in our study aged 70 or older. 

These data on age-specific contact patterns are consistent 
with the range of patterns reported by PolyMod, a study of 
potentially infectious social contacts in European populations 
[8]. Participants in the present study reported similar absolute 
numbers of contacts as those from European countries at the 
higher end of the PolyMod range, such as Italy and Luxembourg. 
The age-specific trends reported here (rescaled to have the same 
maximum amplitude; electronic supplementary material, figure 
Slh) are similar to those reported for most of the European 
countries such as the Great Britain, Italy and Poland in PolyMod, 
other than that adults between 30 and 39 years had proportion- 
ately more contacts than in other countries. Most countries in the 
European study showed consistently decreasing numbers of 
contacts with increasing age. 

The total number of locations at which potentially infec- 
tious contacts were made by participants in our study was 
between 2335 (sum of minimum values) and 3007 (sum of 
maximum values), giving a range of 3.06-3.95 locations per 
person. This range was substantially wider, in proportionate 
terms, than the range observed for contacts. Variation in 
the number of locations per participant, as measured by 



interquartile range, increased consistently with age until a 
decrease for the eldest age group (figure 2a). By contrast, the 
variability in the number of contacts was highest for participants 
between 10 and 29 years of age and then fell away to higher ages 
(figure 2b, c). 

Numbers of contacts were only weakly correlated with 
numbers of locations, once the number of contacts were 
adjusted to reflect the fact that almost every location included 
at least a single additional contact (although not all did, as 
the same single contact was occasionally met in multiple 
locations). The Spearman rank correlation coefficient between 
contacts and locations was 0.44 (figure 2d). However, given 
that a location can be reported only if at least one contact 
occurs, we also calculated the correlation between numbers 
of contacts minus number of locations with number of 
locations, which dropped to 0.18. Although both values are 
significantly different from 0 (^-values both < 0.001), these 
levels of correlation were lower than might be expected and 
justified the inclusion of both contact and location variables 
in the same statistical model. 

(c) Age, contact variables and infection status 

Based on the raw data, both age and number of contacts were 
correlated with infection status (table 1). However, using the 
best subset approach (see Materials and methods), we showed 
that models including self-reported contact and location vari- 
ables but without participant age were not able to describe 
these individual-level infection data as well as regression 
models that included age as a covariate (electronic supplemen- 
tary material, figure S3 and dataset S2). The best model without 
age (electronic supplementary material, table S4) had an Akaike 
information criteria (AIC) of 439, whereas the worst model with 
participant age had an AIC of 411 (electronic supplementary 
material, dataset S2). The best model not including age 
explained only 15.7% of the deviance, whereas the best model 
with age explained 23.6% of deviance. 

(d) Social contacts within age groups 

The contribution of contacts and locations to the risk of infec- 
tion can be seen explicitly by separating age groups into those 
who were infected and those who were not (figure 3a, h 
and electronic supplementary material, figure S2). Other 
than the youngest age group, the median number of con- 
tacts with duration greater than or equal to 10 min for 
those infected in an age group is always greater than 
for those not infected in an age group (Wilcoxon test, 
pvalue < 10 for non-paired comparison of infected and 
uninfected). Similarly, again with the exception of the young- 
est age group, the median number of locations for those 
infected is greater than for those who were not. However, 
likely due to the lower typical value for number of locations, 
the difference between the infected and uninfected subgroups 
for location was not statistically significant (p-value = 0.45). 
The reversal of the trend in the youngest age group may 
reflect lower accuracy of reported contacts. 

(e) Contacts and locations 

We found some support for models of infection that included 
both contact and location variables in addition to age and 
other baseline variables, over and above those that included 
only age and other baseline variables (table 2). The best of 
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Figure 1. Age-specific contact patterns, {a) Distribution of self-reported contacts by age group for the present study. Values are the average per participant of the 
midpoint between reported maximum and minimum. Vertical black lines indicate binomial confidence bounds for the number of contacts across all four contact age 
classes for each participant age group, {b) Comparison of total number of contacts per age group for eight European countries, as reported in the PolyMod study [8]. 
PolyMod reported both participants and contacts in 5 year age groups starting with 0-4 years. Therefore, other than for the lowest age class, we used linear 
interpolation to generate consistent age boundaries from our less detailed data. For the lowest age class of participants, we report a range of 2-4 years, because 
our youngest participant was 2 years old. We assumed that PolyMod recruited few 0 and 1 year olds, so we did not interpolate for the lowest participant age class. 
We did not ask participants to report their contacts in 5 year age groups, (c) Also compares total contacts between the present study and the PolyMod study. 
However, in (c), although the relative amplitude for each age group for individual populations has been preserved, the amplitude has been rescaled, so that 
the maximum value for each country for any age group is 1. 



these models included the number of locations and the 
number of contacts greater than or equal to 60 min and 
had an AIC of 399.3 (compared with an AIC of 402.3 for 
the baseline model). The estimates of odds ratios of infection 
for number of contacts with duration greater than or equal to 
60 min and number of locations were 1.09 and 1.15, respect- 
ively. In other words, an increase in one location visited 
and one long contact increased the odds of infection by 9% 
and 15%, respectively. In addition, the best 27 models overall 
included contact, location and baseline variables. Out of 
models that included only contact and baseline variables, 
the best model included the number of contacts with duration 
greater than or equal to 10 min. 



The magnitude and significance of the contact and 
location variables were consistent. The increase in odds of 
infection per location was approximately 15% but only of 
borderline significance. The increase in odds of infection 
per 10 contacts was approximately 10%, and was usually sig- 
nificant (at the 5% level; the electronic supplementary 
material, dataset S2). The values of the other parameters 
were consistent across the four best models of each type 
(table 2): all were equal up to two significant figures other 
than the district parameter for New Territories East in the 
best model with contact and location variables. All four of 
the best models were consistent with the data using standard 
goodness-of-fit tests for logistic regression models [16]. 
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Figure 2. Comparison between locations and contacts, [a) Box and whisker plot [14] for total number of reported locations (midpoint of maximum and minimum) 
for participant age groups, [b) Box and whisker plot for total number of reported contacts (midpoint of maximum and minimum) for participant age groups. Note 
that participants with number of reported contacts >100 are not plotted in this chart, (c) This figure is the reproduction of [b] except that this represents the 
subjects with number of reported contacts >100. (cf) Scatter plot of total number of reported locations (midpoint of maximum and minimum) and total number of 
reported contacts (midpoint of maximum and minimum). Above and to the left of the y = x line, most of the participants reported at least as many contacts 
as locations. 



(f) Sensitivity analysis 

Of the 762 participants in this study, 616 characterized the 
day for which they reported contacts as being a typical day. 
We recalculated our regression model results for this subset 
of the cohort (electronic supplementary material, table S5), 
finding very similar results to those presented above. For 
the typical day set of best models, the two different contact 
variables (greater than 10 min and greater than 60 min) 
were replaced by a single variable: number of contacts greater 
than or equal to 30 min. In addition, the location variable in 
the best model with location and not contacts became border- 
line significant, rather than borderline not significant (to the 
5% level). 



We also examined association between these exposure data 
and infection using a hypotheses-agnostic lasso regression 
model [13] (table 2). In general, the lasso regression results are 
supportive of the best subset results. Although the point esti- 
mates for the strength of effect of contact variables (number of 
locations and number of contacts with duration greater than 
60 min) are reduced (as would be expected [12]), they remain 
statistically significant in models selected using group lasso 
regression. Interestingly, some social behaviour parameters 
other than those retained in our best subset analysis are retained 
more frequently in a bootstrap analysis of the group lasso models 
(electronic supplementary material, table S6). However, the boot- 
strap confidence intervals for these tended not to be significant. 
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Figure 3. Infection, age, contacts and locations, [a) Distribution of number of contacts with duration greater than or equal to 10 min per age group and infection 
status. Inset same as main chart but with wider y-axis showing all outliers, (b) Distributions of numbers of locations per age group and infection status. In both 
parts: green denotes not infected and red infected; features follow the usual convention for box and whisker plots [14]. 



Table 1. The number and percentages of subjects who were seroconverted by age group and minimum number of contacts. 
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4. Discussion 

We embedded a questionnaire within a longitudinal serologi- 
cal cohort survey to test the strength of association between 
self-reported contacts and odds of influenza virus infection. 
From the questionnaire, we extracted variables related to 



individual contacts and to the locations at which contacts 
were made. Numbers of locations and contacts reported by 
individuals were only weakly correlated. Among a large set 
of logistic regression models, each with serologically con- 
firmed influenza virus infection as the outcome, we found 
overwhelming support for models that used participant age 



explicitly, over models that used only contact variables: con- 
tact behaviour alone at the level of the individual did not 
explain age-specific odds of infection. Within age groups, 
both number of locations and numbers of contacts increased 
odds of infection (approx. 10% per 10 contacts or per location; 
models that did include these terms received similar support 
to those that did not). These age-specific patterns of infection 
could be seen explicitly in the data: other than the youngest 
age group, the median number of contacts or locations was 
always higher in those infected (in a given age group) than 
in those not infected. 

It is not possible or desirable to consider every possible 
statistical model. There may be spline- or interaction-based 
logistic models that are better able to reproduce the patterns 
seen in these data. However, so far as is ever possible, we 
have sought in this initial report to describe the key features 
of the data and the relationship between our data and a large 
set of models in which we had an interest prior to describing 
the data. While we acknowledge that some of the parameter 
estimates may be considered only marginally significant, we 
achieved good consistency over a variety of regression mod- 
elling techniques. It is noteworthy that the parameters 
retained in our best subset analysis were always significant 
in the lasso regression bootstrap analysis. This does point 
towards a subset of influential data points. However, that is 
not unexpected given the strong right-skew commonly 
observed in the distribution of social-contact variables. All 
individual-level data are provided, so that others can con- 
tinue these lines of investigation (electronic supplementary 
material, dataset SI). 

Although our results are based only on a single population, 
we believe that, as the first robust investigation of the associ- 
ation between self-reported social contacts and biologically 
confirmed respiratory infection [6], these results represent a 
useful addition to population-level evidence [5,17,18]. The 
requirement for age in our regression models indicates clearly 
that factors other than self-reported contacts of individuals — as 
recorded here — drove individual odds of infection. 

There are a number of explanations for the inability of 
social contacts to account for age-specific infection patterns 
in our data using these modelling approaches. First, the 
way that one age group makes contacts and reports them 
may be very different from another age group. Direct obser- 
vation- or proximity-based studies may be able to tease out 
these differences [6]. In addition, it is possible that there 
was a unique profile of susceptibility to the 2009 pandemic 
strain that will not be present for other influenza pandemics. 
If we had studied a population of fully immunologically 
naive individuals across all age groups, we may well have 
seen a different relationship. 

We chose to present these data initially using traditional 
epidemiological approaches for primary analysis and we 
have not fitted any mechanistic transmission models. A sec- 
ondary analysis used a stratified final size model (with 
arbitrary resolution in age and numbers of contacts) to capture 
the contact structure of the population [19] and was able to 
show that the average behaviour of an age group, as measured 
by self -reported social contacts, was an important determinant 
of infection risk. 

Methods of measuring potentially infectious contacts are 
themselves [6] a topic of active investigation. Here, we 
chose a retrospective interviewer-led questionnaire approach. 
Participants were informed in advance, by phone, of the 



nature of the discussion, but they were not given a diary. 
Although there are unavoidable issues with recall-bias, this 
design has many advantages over alternatives: it does not 
directly invade privacy and hence change behaviour; it is 
not overly sensitive to the specific design of the form (because 
the participant did not see the form), contacts within large 
buildings are as likely to be recorded as those in the open 
air, and there is no need to recruit potential contacts into 
the study, so that they can wear study equipment. Perhaps 
most importantly, we are measuring what participants them- 
selves consider to be a contact. Therefore, robust associations 
between self -reported contacts and infection outcomes gener- 
ate, by definition, evidence that could be actionable by 
individuals: people may not note how many other people 
they come within 1 m of in a given day, but they do know 
the number of people they talk to and shake hands with. 

We asked only about a single day immediately prior to 
the follow-up interview. Given that antibody titres take up 
to three weeks to rise after infection [20], it is unlikely that 
we asked about the actual day that people were infected. 
Therefore, we implicitly make the assumption that variation 
between individuals in their social behaviour for both con- 
tacts and locations was stationary in time and, by asking 
about a single day, we obtained a representative snapshot 
of that behaviour. It is possible that self-reported contacts — 
recorded in exactly the same manner we did here — for the 
actual day of exposure may explain variability in the odds 
of infection with greater accuracy. However, our sensitivity 
analysis using the participant's characterization of the day 
in question as being typical or atypical suggests otherwise. 
Therefore, our results motivate two specific refinements of 
this questionnaire-based approach. First, prospective repeat 
questionnaire studies, without biological outcome data, 
would help characterize the degree to which responses 
from the same individual remain constant over time [6]. 
Second, a case -control approach could be used to obtain bio- 
logically confirmed infection outcomes and social-contact 
data immediately prior to the onset of symptoms. 



All study protocols and instruments were approved by the insti- 
tutional review boards of the University of Hong Kong. Written 
informed consent was sought from each participant aged 18 or 
above. Written proxy consent was sought from the head of household 
of all participants aged 17 or below. In addition, the written assent to 
participate was asked from participants of aged 7-17. 
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